Game of Thrones character deaths and killers

Group 5: Li-Yen Huang (lh36263), Kai Ng (yn2979)

Overview

In our project analyzing Game of Thrones character deaths, we aim to gain insights into the power dynamics within the series by studying the relationships between characters, the houses they belong to, and the individuals responsible for the killings. Some goals that we would like to achieve with our project are identifying patterns in character deaths throughout the series and identifying patterns in internal and external conflicts between the houses of Westeros.

Goal

The project focuses on discovering interesting insights about the deaths in Game of Thrones through data analysis. We address the following questions:

  • Task1: How do the death of prominent characters influence the ratings?

  • Task2: Which regions are targeted most often?

We will visualize our findings to present the information in a clear and engaging way for users.

Workflow

We started by downloading raw data from various sources and importing it into RStudio/Jupyter. Since the data came in multiple formats—CSV, JSON, and XLSX—we used proper functions to load the data into the project database. Next, we cleaned and formatted the data, and further validated the data. Then, we merged them into a single, unified data frame. We then queried this data frame to extract relevant information for visualizing and identifying trends.

Database Design

Episode - Death: The episode and death data sets have a shared common columns: season and episode. This allows us to map each episode to the on-screen deaths that occurred within it.

Death - Character: Ideally, there should be a one-to-one mapping between the death data set and the character data set. However, we lacked sufficient data to know clear relationships in all cases. Some characters in the show are unnamed, and the data set uses generic terms like “Stark Soldier” to refer to them, making the mapping relationship ambiguous. As a result, we excluded entries where the mapping between the two data sets was uncertain.

Death - Location/Region - Map Location: We planned to link death locations from the death data set to corresponding entries in the map location data set. However, the names did not have a direct one-to-one relationship. For example, “Castle Black” (death location) is part of “The Wall” (map location), but we lacked sufficient data to make these connections. To address this, we used AI to generate a look-up table. Since the AI-generated data might not be fully reliable, we manually verified a sample of the entries. Based on these checks, we implemented the look-up table to help us join the data sets.

Data Sources

We used five data sources, one of which was generated by AI. The remaining data were compiled by three individual fans: David Murphy, Mohammad Reza Ghari, and Jeffery Lancaster.

Death data

  • Description: This data set documents every on-screen death in Game of Thrones. It includes information on who killed whom, the method applied, the location, and the episode of death. The author sets specific criteria to determine what qualifies as an on-screen death, and more details are provided in the link.

  • Source: The data is collected by David Murphy.

  • Link: https://data.world/datasaurusrex/game-of-thones-deaths

  • Challenges:

    • Killers without houses: We identified that some killers do not belong to any houses. Initially, we considered replacing this with ‘N/A.’ However, we recognized that it is reasonable for some killers not to be associated with any houses. We decided to leave it “None”.

    • Repeated names: Some death names are repeated. This is because certain characters in the show do not have specific names and are referred to by general terms, such as “Tribesman” or “Wildling.” Despite this, each entry still represents a single death. We decided to retain these entries in the table.

Episode data

  • Description: This data set provides detailed episode-level data for Game of Thrones. It includes key information such as season and episode numbers, titles, release dates, IMDb ratings, vote counts, reviews, and key contributors like writers and actors.

  • Source: The data is collected by Mohammad Reza Ghari. He scraped the data from IMDB & Wikipedia.

  • Link: https://www.kaggle.com/datasets/rezaghari/game-of-thrones

  • Challenges:

    • Non-standard date format: The data uses the “12-Apr-2004” format, which is a non-standard date format in R’s tidyverse. However, R can still recognize this format, but it requires using the as.Date() function to convert it into a proper date object.

Location data

  • Description: This data set contains geographical coordinates for various regions on the Ice and Fire world map.

  • Source: The data set is compiled by Jeffery Lancaster, who sourced information from the Game of Thrones Wiki and manually encoded each location.

  • URL: https://github.com/jeffreylancaster/game-of-thrones/blob/master/data/opening-locations.json

  • Challenges:

    • Non-rectangular data: The location data comes from a JSON file, which is not a rectangular data. It consists of two key-value pairs: “note” and “locations.” Fortunately, the “locations” data itself is in a rectangular format, making it easy to extract and use.

Characters data

  • Description: This data set provides detailed insights into the Game of Thrones characters, including the total number of episodes they appeared in, along with their first and final appearance years on the show.

  • Source: The data is collected by Mohammad Reza Ghari. He scraped the data from IMDB & Wikipedia.

  • URL: https://www.kaggle.com/datasets/rezaghari/game-of-thrones?select=characters_v4.csv

  • Challenges:

    • Repeated characters: The character data includes repeated entries for unnamed characters, such as “Stark Soldier.” Since these generic terms represent distinct individuals in the show, we decided to retain all entries in the data set.

Look-up table for mapping locations and regions

  • Description: The mapping relationship between death locations and map regions.

  • Source: Due to the unavailability of look-up table between death locations and map regions, we generated this data using the AI tool, Perplexity. We selected Perplexity because it can use web searches to generate answers and provides references for its responses, ensuring greater accuracy and up-to-date information.

  • *Note: Since the prompting process relies on the location and death data, it will be explained in detail in the data importing section.

  • Challenges:

    • Refine AI Answer: While AI is a powerful tool for accessing data we may lack, it can sometimes produce inaccurate results. To improve accuracy, we initially told Perplexity, “If a location doesn’t match any listed regions, leave it blank.” However, the output only matched locations to regions when their names were nearly identical, leaving many entries blank. This approach did not produce good results. We then revised our prompt, asking the AI to infer the best matches based on its knowledge. To verify these results, we manually reviewed a sample of matches by searching for location names online and referring to sources such as the Wiki of Westeros fandom site. The results seemed to fit well. Although this method isn’t entirely reliable, the answer seemed good enough to meet our needs.

      First Prompt Refined Prompt

      Categorize each of the following Game of Thrones locations into their respective regions and present the output in a table. If a location does not belong to any of the listed regions, leave it blank.

      ……

      Categorize each of the following Game of Thrones locations into their respective regions and present the output in a table. If a location does not have a direct one-to-one mapping, use your best knowledge to infer its region.

      ……

      Result:

      Result:

Import Data

Environment Setup

At first, we imported the necessary libraries for our project.

library(tidyverse)
library(readxl) # library for reading data from excel
library(naniar) # library for handling missing data
library(ggplot2) # library for graph
library(jsonlite) # library for reading json file
library(magick) # image processing took kit
library(tidyverse)
library(readxl)
library(lubridate)
library(jsonlite)

Connecting to DuckDB

Then, we loaded the DuckDB library to connect to the local DuckDB database.

library(DBI)
library(duckdb)
options(duckdb.enable_rstudio_connection_pane=TRUE)

# create / connect to database file
drv <- duckdb()
con <- dbConnect(drv)

Install the reticulate package to enable running Python code in R.

library(reticulate)
py_install("pandas")
py_install("matplotlib")
py_install("requests")
py_install("openpyxl")
import pandas as pd
from io import StringIO
import re
import requests
from PIL import Image, ImageEnhance 
from io import BytesIO
import matplotlib.pyplot as plt
import numpy as np

Import Death Data

We loaded data from the ‘Game of Thrones Deaths.xlsx’ file. To avoid the program automatically inferring data types and potentially introducing ambiguity, we explicitly specified the col_types parameter in read_excel. Furthermore, we renamed the columns to maintain consistent and uniform header naming conventions.

death_data_type = c("text", "text", "numeric", "numeric", "text", "text", "text", "text", "numeric")

got_deaths = read_excel( "data/Game of Thrones Deaths.xlsx", col_types = death_data_type ) |>
  janitor::clean_names()

got_deaths

Validation Queries

# Check row count
got_deaths |> 
  summarize(row_count = n())
# Check data types
got_deaths |> 
  glimpse()
Rows: 2,224
Columns: 9
$ name          <chr> "Waymar Royce", "Gared", "Will", "Stag", "Direwolf", "Jo…
$ allegiance    <chr> "Night's Watch", "Night's Watch", "Night's Watch", "None…
$ season        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode       <dbl> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 4, 5, 5, 5, 5, 5, 5, 5, 5,…
$ location      <chr> "Beyond the Wall", "Beyond the Wall", "Winterfell", "Win…
$ killer        <chr> "White Walker", "White Walker", "Ned Stark", "Direwolf",…
$ killers_house <chr> "None", "None", "House Stark", "None", "None", "House Ar…
$ method        <chr> "Ice sword", "Ice sword", "Sword", "Teeth", "Antler", "P…
$ death_no      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
# Check missing values
got_deaths |> 
  summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))
# Check for duplicates
got_deaths |> 
  summarize(duplicate_count = sum(duplicated(got_deaths)))

We loaded data from the ‘Game of Thrones Deaths.xlsx’ file using the st_read function. This method directly imported the data into our DuckDB database, creating the raw_deaths table.

--Extenstion to read data from an Excel file
INSTALL spatial;
LOAD spatial;

CREATE OR REPLACE TEMP TABLE raw_deaths AS
SELECT * FROM st_read('data/Game of Thrones Deaths.xlsx');

To create the got_deaths table, we transformed the original raw_deaths dataset by normalizing key fields and ensuring consistency in data types. We generated normalized columns (normalized_name, normalized_allegiance, and normalized_location) by converting text to lowercase and trimming any extra whitespace, which helps standardize the data for efficient joins and comparisons. We also cast several fields to appropriate data types, such as season and episode to INTEGER, and other descriptive fields like name, allegiance, and method to TEXT.

CREATE OR REPLACE TEMP TABLE got_deaths AS
SELECT 
    LOWER(TRIM(name)) AS normalized_name,
    LOWER(TRIM(allegiance)) AS normalized_allegiance,
    CAST(season AS INTEGER) AS season,
    CAST(episode AS INTEGER) AS episode,
    LOWER(TRIM(location)) AS normalized_location,
    CAST(killer AS TEXT) AS killer,
    CAST("killers house" AS TEXT) AS killers_house,
    CAST(method AS TEXT) AS method,
    CAST("death no." AS INTEGER) AS death_no
FROM raw_deaths;

Validation Queries

-- Check row count
SELECT COUNT(*) AS row_count
FROM got_deaths;
1 records
row_count
2796
-- Check data types
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'got_deaths';
9 records
column_name data_type is_nullable
normalized_name VARCHAR YES
normalized_allegiance VARCHAR YES
season INTEGER YES
episode INTEGER YES
normalized_location VARCHAR YES
killer VARCHAR YES
killers_house VARCHAR YES
method VARCHAR YES
death_no INTEGER YES
-- Check missing values
SELECT column_name, COUNT(*) - COUNT(column_name) AS missing_count
FROM information_schema.columns
WHERE table_name = 'got_deaths'
GROUP BY column_name;
9 records
column_name missing_count
normalized_name 0
killers_house 0
normalized_allegiance 0
season 0
episode 0
killer 0
normalized_location 0
death_no 0
method 0
-- Check for duplicates
SELECT 
COUNT(*) - COUNT(DISTINCT (normalized_name, normalized_allegiance, season, episode, normalized_location, killer, killers_house, method, death_no)) AS duplicate_count
FROM got_deaths;

--There are duplicates due to a general character appearing multiple times in a single episode or a character was played by different actors in an episode
1 records
duplicate_count
571

Once the table was created, we tested it to ensure it was functioning as expected.

SELECT *
FROM got_deaths;
Displaying records 1 - 10
normalized_name normalized_allegiance season episode normalized_location killer killers_house method death_no
waymar royce night’s watch 1 1 beyond the wall White Walker None Ice sword 1
gared night’s watch 1 1 beyond the wall White Walker None Ice sword 2
will night’s watch 1 1 winterfell Ned Stark House Stark Sword 3
stag none 1 1 winterfell Direwolf None Teeth 4
direwolf none 1 1 winterfell Stag None Antler 5
jon arryn house arryn 1 1 king’s landing Lysa Arryn House Arryn Poison 6
dothraki man dothraki 1 1 pentos Dothraki man Dothraki Arakh 7
catspaw assassin none 1 2 winterfell Summer House Stark Teeth 8
mycah smallfolk 1 2 kingsroad Sandor “the Hound” Clegane House Lannister Sword 9
lady house stark 1 2 kingsroad Ned Stark House Stark Knife 10

We loaded data from the ‘Game of Thrones Deaths.xlsx’ file. To avoid Python automatically inferring data types and potentially introducing ambiguity, we explicitly specified the dtype parameter in read_excel. Furthermore, we renamed the columns to maintain consistent and uniform header naming conventions.

death_type_spec = {
    'Name': 'string',
    'Allegiance': 'string',
    'Season': 'int32',
    'Episode': 'int32',
    'Location': 'string',
    'Killer': 'string',
    'Killers House': 'string',
    'Method': 'string',
    'Death No.': 'int32'
}

# Read data from excel file
raw_deaths = pd.read_excel(open('data/Game of Thrones Deaths.xlsx', 'rb'), sheet_name = 'Game of Thrones Deaths collecti', dtype = death_type_spec)  


# Clean the column names: convert to lowercase, replace spaces with '_', and remove special characters
got_deaths = (
    raw_deaths.rename(columns = str.lower)
              .rename(columns = lambda x: x.replace( ' ', '_' ))
              .rename(columns = lambda x: re.sub('[^a-z0-9_]', '', x ))
)

got_deaths
                               name  ... death_no
0                      Waymar Royce  ...        1
1                             Gared  ...        2
2                              Will  ...        3
3                              Stag  ...        4
4                          Direwolf  ...        5
...                             ...  ...      ...
2219  Gregor "the Mountain" Clegane  ...     2220
2220     Sandor "the Hound" Clegane  ...     2221
2221                Jaime Lannister  ...     2222
2222               Cersei Lannister  ...     2223
2223             Daenerys Targaryen  ...     2224

[2224 rows x 9 columns]

Validation Queries

# Check row count
len(got_deaths)
2224
# Check data types for each column
got_deaths.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2224 entries, 0 to 2223
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           2224 non-null   string
 1   allegiance     1792 non-null   string
 2   season         2224 non-null   int32 
 3   episode        2224 non-null   int32 
 4   location       2224 non-null   string
 5   killer         2214 non-null   string
 6   killers_house  2194 non-null   string
 7   method         2224 non-null   string
 8   death_no       2224 non-null   int32 
dtypes: int32(3), string(6)
memory usage: 130.4 KB
# Check missing values
got_deaths.isnull().sum()
name               0
allegiance       432
season             0
episode            0
location           0
killer            10
killers_house     30
method             0
death_no           0
dtype: int64
# Check for duplicates
got_deaths.duplicated().sum()
np.int64(0)

Note: The validation results for missing values show slight differences in Python. This is because Python’s isnull() function treats None as a null value.

Import Episode Data

The episode data is imported from the ‘GOT_episodes_v4.csv’ file. Following the same process as with the death data, we explicitly specified the data types for tidyverse to prevent inaccurate type guessing. Additionally, we formatted date strings, such as “20-Apr-2014,” into date types.

episode_data_col_spec = cols(
  Season = col_double(),
  Episode = col_double(),
  Title = col_character(),
  Release_date = col_character(),
  Rating = col_double(),
  Votes = col_double(),
  Summary = col_character(),
  Writer_1 = col_character(),
  Writer_2 = col_character(),
  Star_1 = col_character(),
  Star_2 = col_character(),
  Star_3 = col_character(),
  Users_reviews = col_double(),
  Critics_reviews = col_double(),
  US_Viewers = col_double(),
  Duration = col_double(),
  Director = col_character(),
  Budget_estimate = col_double()
)

raw_episodes = read_csv("data/GOT_episodes_v4.csv", col_types = episode_data_col_spec) |>
  janitor::clean_names()

got_episodes <- raw_episodes |>
    mutate( release_date = as.Date(release_date, format = "%d-%b-%y") )

got_episodes

Validation Queries

# Check row count
raw_episodes |> 
  summarize(row_count_match = n() == nrow(got_episodes))
# Check data types
got_episodes |> 
  glimpse()
Rows: 73
Columns: 18
$ season          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, …
$ episode         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8,…
$ title           <chr> "Winter Is Coming", "The Kingsroad", "Lord Snow", "Cri…
$ release_date    <date> 2011-04-17, 2011-04-24, 2011-05-01, 2011-05-08, 2011-…
$ rating          <dbl> 9.1, 8.8, 8.7, 8.8, 9.1, 9.2, 9.2, 9.0, 9.6, 9.5, 8.8,…
$ votes           <dbl> 38639, 29285, 27694, 26284, 27349, 27079, 27556, 25645…
$ summary         <chr> "Eddard Stark is torn between his family and an old fr…
$ writer_1        <chr> "David Benioff", "David Benioff", "David Benioff", "Da…
$ writer_2        <chr> "D.B. Weiss", "D.B. Weiss", "D.B. Weiss", "D.B. Weiss"…
$ star_1          <chr> "Sean Bean", "Sean Bean", "Sean Bean", "Sean Bean", "S…
$ star_2          <chr> "Mark Addy", "Mark Addy", "Mark Addy", "Mark Addy", "M…
$ star_3          <chr> "Nikolaj Coster-Waldau", "Nikolaj Coster-Waldau", "Nik…
$ users_reviews   <dbl> 61, 27, 21, 22, 24, 23, 21, 20, 30, 34, 22, 19, 15, 19…
$ critics_reviews <dbl> 30, 28, 29, 26, 26, 26, 27, 26, 29, 29, 29, 25, 25, 24…
$ us_viewers      <dbl> 2.22, 2.20, 2.44, 2.45, 2.58, 2.44, 2.40, 2.72, 2.66, …
$ duration        <dbl> 62, 56, 58, 56, 55, 53, 58, 59, 57, 53, 53, 54, 53, 51…
$ director        <chr> "Timothy Van Patten", "Timothy Van Patten", "Brian Kir…
$ budget_estimate <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# Check missing values
got_episodes |> 
  summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))
#Check for duplicates
raw_episodes |> 
  summarize(duplicate_count = sum(duplicated(raw_episodes)))

The episode data is imported from the ‘GOT_episodes_v4.csv’ file by directly reading the data file into our DuckDB database, creating the raw_episodes table.

CREATE OR REPLACE TEMP TABLE raw_episodes AS
SELECT * FROM 'data/GOT_episodes_v4.csv';

We then created the got_episodes table to transform the original raw_episodes dataset into a more structured format. We casted several columns to their appropriate data types: season and episode to INTEGER, and rating, us_viewers, and budget_estimate to FLOAT. We also normalized text fields, such as title, summary, writers, stars, and director by converting them to lowercase for consistency. The release_date was processed using STRPTIME to convert it into a proper date format.

CREATE OR REPLACE TEMP TABLE got_episodes AS
SELECT 
    CAST(season AS INTEGER) AS season,
    CAST(episode AS INTEGER) AS episode,
    LOWER(title) AS title,
    STRPTIME(release_date, '%d-%b-%y') AS release_date,
    CAST(rating AS FLOAT) AS rating,
    CAST(votes AS INTEGER) AS votes,
    LOWER(summary) AS summary,
    LOWER(writer_1) AS writer_1,
    LOWER(writer_2) AS writer_2,
    LOWER(star_1) AS star_1,
    LOWER(star_2) AS star_2,
    LOWER(star_3) AS star_3,
    CAST(users_reviews AS INTEGER) AS users_reviews,
    CAST(critics_reviews AS INTEGER) AS critics_reviews,
    CAST(us_viewers AS FLOAT) AS us_viewers,
    CAST(duration AS FLOAT) AS duration,
    LOWER(director) AS director,
    CAST(budget_estimate AS FLOAT) AS budget_estimate
FROM raw_episodes;

Validation Queries

-- Check row count
SELECT COUNT(*) AS row_count
FROM got_episodes;
1 records
row_count
73
-- Check data types
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'got_episodes';
Displaying records 1 - 10
column_name data_type is_nullable
season INTEGER YES
episode INTEGER YES
title VARCHAR YES
release_date TIMESTAMP YES
rating FLOAT YES
votes INTEGER YES
summary VARCHAR YES
writer_1 VARCHAR YES
writer_2 VARCHAR YES
star_1 VARCHAR YES
-- Check missing values
SELECT column_name, COUNT(*) - COUNT(column_name) AS missing_count
FROM information_schema.columns
WHERE table_name = 'got_episodes'
GROUP BY column_name;
Displaying records 1 - 10
column_name missing_count
summary 0
star_2 0
star_3 0
writer_1 0
director 0
title 0
release_date 0
votes 0
duration 0
season 0
-- Check for duplicates
SELECT 
COUNT(*) - COUNT(DISTINCT (season, episode, title, release_date, rating, votes, summary, writer_1, writer_2, star_1, star_2, star_3, users_reviews, critics_reviews, us_viewers, duration, director, budget_estimate)) AS duplicate_count
FROM got_episodes;
1 records
duplicate_count
0

Once the table was created, we tested it to ensure it was functioning as expected.

SELECT *
FROM got_episodes;
Displaying records 1 - 10
season episode title release_date rating votes summary writer_1 writer_2 star_1 star_2 star_3 users_reviews critics_reviews us_viewers duration director budget_estimate
1 1 winter is coming 2011-04-17 9.1 38639 eddard stark is torn between his family and an old friend when asked to serve at the side of king robert baratheon; viserys plans to wed his sister to a nomadic warlord in exchange for an army. david benioff d.b. weiss sean bean mark addy nikolaj coster-waldau 61 30 2.22 62 timothy van patten 0
1 2 the kingsroad 2011-04-24 8.8 29285 while bran recovers from his fall, ned takes only his daughters to king’s landing. jon snow goes with his uncle benjen to the wall. tyrion joins them. david benioff d.b. weiss sean bean mark addy nikolaj coster-waldau 27 28 2.20 56 timothy van patten 0
1 3 lord snow 2011-05-01 8.7 27694 jon begins his training with the night’s watch; ned confronts his past and future at king’s landing; daenerys finds herself at odds with viserys. david benioff d.b. weiss sean bean mark addy nikolaj coster-waldau 21 29 2.44 58 brian kirk 0
1 4 cripples, bastards, and broken things 2011-05-08 8.8 26284 eddard investigates jon arryn’s murder. jon befriends samwell tarly, a coward who has come to join the night’s watch. david benioff d.b. weiss sean bean mark addy nikolaj coster-waldau 22 26 2.45 56 brian kirk 0
1 5 the wolf and the lion 2011-05-15 9.1 27349 catelyn has captured tyrion and plans to bring him to her sister, lysa arryn, at the vale, to be tried for his, supposed, crimes against bran. robert plans to have daenerys killed, but eddard refuses to be a part of it and quits. david benioff d.b. weiss sean bean mark addy nikolaj coster-waldau 24 26 2.58 55 brian kirk 0
1 6 a golden crown 2011-05-22 9.2 27079 while recovering from his battle with jaime, eddard is forced to run the kingdom while robert goes hunting. tyrion demands a trial by combat for his freedom. viserys is losing his patience with drogo. david benioff d.b. weiss sean bean mark addy michelle fairley 23 26 2.44 53 daniel minahan 0
1 7 you win or you die 2011-05-29 9.2 27556 robert has been injured while hunting and is dying. jon and the others finally take their vows to the night’s watch. a man, sent by robert, is captured for trying to poison daenerys. furious, drogo vows to attack the seven kingdoms. david benioff d.b. weiss sean bean mark addy nikolaj coster-waldau 21 27 2.40 58 daniel minahan 0
1 8 the pointy end 2011-06-05 9.0 25645 the lannisters press their advantage over the starks; robb rallies his father’s northern allies and heads south to war; the white walkers attack the wall; tyrion returns to his father with some new friends. david benioff d.b. weiss sean bean michelle fairley lena headey 20 26 2.72 59 daniel minahan 0
1 9 baelor 2011-06-12 9.6 36217 robb goes to war against the lannisters. jon finds himself struggling on deciding if his place is with robb or the night’s watch. drogo has fallen ill from a fresh battle wound. daenerys is desperate to save him. david benioff d.b. weiss sean bean michelle fairley nikolaj coster-waldau 30 29 2.66 57 alan taylor 0
1 10 fire and blood 2011-06-19 9.5 31789 robb vows to get revenge on the lannisters. jon must officially decide if his place is with robb or the night’s watch. daenerys says her final goodbye to drogo. david benioff d.b. weiss sean bean michelle fairley nikolaj coster-waldau 34 29 3.04 53 alan taylor 0

The episode data is imported from the ‘GOT_episodes_v4.csv’ file. Following the same process as with the death data, we explicitly specified the dtype to prevent inaccurate type guessing. Additionally, we formatted date strings, such as “20-Apr-2014,” into date types.

episode_type_spec = {
    'Season': 'int32',
    'Episode': 'int32',
    'Title': 'string',
    'Release_date': 'string',
    'Rating': 'float64',
    'Votes': 'int32',
    'Summary': 'string',
    'Writer_1': 'string',
    'Writer_2': 'string',
    'Star_1': 'string',
    'Star_2': 'string',
    'Star_3': 'string',
    'Users_reviews': 'int32',
    'Critics_reviews': 'int32',
    'US_Viewers': 'float64',
    'Duration': 'float64',
    'Director': 'string',
    'Budget_estimate': 'float64'
}

# Read the CSV file
raw_episodes = pd.read_csv(
    "data/GOT_episodes_v4.csv",
    dtype=episode_type_spec
)

got_episodes = (
        raw_episodes.rename( columns=str.lower )
                    .assign( release_date=lambda df: pd.to_datetime( df['release_date'], format='%d-%b-%y' ) )
)

got_episodes
    season  episode  ...            director budget_estimate
0        1        1  ...  Timothy Van Patten             0.0
1        1        2  ...  Timothy Van Patten             0.0
2        1        3  ...          Brian Kirk             0.0
3        1        4  ...          Brian Kirk             0.0
4        1        5  ...          Brian Kirk             0.0
..     ...      ...  ...                 ...             ...
68       8        2  ...        David Nutter             0.0
69       8        3  ...    Miguel Sapochnik             0.0
70       8        4  ...        David Nutter             0.0
71       8        5  ...    Miguel Sapochnik             0.0
72       8        6  ...       David Benioff             0.0

[73 rows x 18 columns]

Validation Queries

# Check data types for each column
got_episodes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73 entries, 0 to 72
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   season           73 non-null     int32         
 1   episode          73 non-null     int32         
 2   title            73 non-null     string        
 3   release_date     73 non-null     datetime64[ns]
 4   rating           73 non-null     float64       
 5   votes            73 non-null     int32         
 6   summary          73 non-null     string        
 7   writer_1         73 non-null     string        
 8   writer_2         73 non-null     string        
 9   star_1           73 non-null     string        
 10  star_2           73 non-null     string        
 11  star_3           73 non-null     string        
 12  users_reviews    73 non-null     int32         
 13  critics_reviews  73 non-null     int32         
 14  us_viewers       73 non-null     float64       
 15  duration         73 non-null     float64       
 16  director         73 non-null     string        
 17  budget_estimate  73 non-null     float64       
dtypes: datetime64[ns](1), float64(4), int32(5), string(8)
memory usage: 9.0 KB
# Check missing values
got_episodes.isnull().sum()
season             0
episode            0
title              0
release_date       0
rating             0
votes              0
summary            0
writer_1           0
writer_2           0
star_1             0
star_2             0
star_3             0
users_reviews      0
critics_reviews    0
us_viewers         0
duration           0
director           0
budget_estimate    0
dtype: int64
# Check for duplicates
(got_episodes.groupby(['season', 'episode'])
  .size()
  .reset_index(name='count')
  .query('count > 1'))
Empty DataFrame
Columns: [season, episode, count]
Index: []

Import Map Location Data

The map location data is imported from the ‘opening-locations.json’ file, which contains two key-value pairs: note and locations. We processed each individually. From the note key-value pair, we extracted the URL of the map image using a str_extract() with a regex pattern. For the locations key-value pair, we extracted the data into a location data frame. We then further examined this data frame to ensure it is in a rectangular data structure.

raw_map_data <- fromJSON("data/opening-locations.json")

got_map_url <- raw_map_data$note |>
  str_extract( "https://[^ ]+" )

got_map_url
[1] "https://vignette.wikia.nocookie.net/iceandfire/images/3/37/Ice_and_Fire_World_Map.png/revision/latest?cb=20130127004523"
# Transform data into a rectangular format
tibble(locations = raw_map_data$locations) |>
  unnest_wider(locations, ptype = list(
      name = character(),
      fx = numeric(),
      fy = numeric()
      )) -> got_locations

got_locations

Validation Queries

# Ensure `raw_map_data` is a data frame
raw_map_data <- as.data.frame(raw_map_data)

# Check row count
raw_map_data |> 
  summarize(row_count_match = n() == nrow(got_locations))
# Check data types
got_locations |> 
  glimpse()
Rows: 22
Columns: 3
$ name <chr> "Astapor", "Braavos", "Dorne", "Dragonstone", "Dreadfort", "Eastw…
$ fx   <dbl> 19.196, 10.342, 8.416, 8.311, 7.280, 7.395, 6.218, 6.697, 7.025, …
$ fy   <dbl> 17.342, 9.363, 16.529, 11.967, 5.999, 3.822, 11.519, 12.759, 4.91…
# Check missing values
got_locations |> 
  summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))
# Check for duplicates
got_locations |> 
  summarize(duplicate_count = sum(duplicated(got_locations)))
# Check if the location dataframe is rectangular
# We verify whether the number of each column has the same number of rows
got_locations |> 
  summarize(is_rectangular = all(across(everything(), ~ length(.) == nrow(got_locations))))

The map location data is imported from the ‘opening-locations.json’ file, which contains two key-value pairs: note and locations. We processed each individually by first loading the JSON data into a table named raw_map_data using the read_json_auto function. The locations field, which is an array of nested JSON objects, was then flattened using the UNNEST function to create individual rows for each object in the array. The ->> operator was used to extract specific values (name, fx, and fy) from the JSON objects as plain text. These extracted fields were then transformed and stored in a new table named locations. The notes field was directly extracted into another table named notes.

--Extension to read data from a JSON file
INSTALL json;
LOAD json;

CREATE OR REPLACE TEMP TABLE raw_map_data AS
SELECT *
FROM read_json_auto('data/opening-locations.json');
  
CREATE OR REPLACE TEMP TABLE locations AS
SELECT
    location.value->>'name' AS name, --Extracts the 'name' field as text
    CAST(location.value->>'fx' AS DOUBLE) AS fx, --Extracts 'fx' and casts it to DOUBLE
    CAST(location.value->>'fy' AS DOUBLE) AS fy --Extracts 'fy' and casts it to DOUBLE
FROM raw_map_data,
UNNEST(locations) AS location(value);

CREATE OR REPLACE TEMP TABLE notes AS
SELECT note
FROM raw_map_data;

Map URL

We created the got_map_url table to extract the map URL from the notes table. Using the REGEXP_EXTRACT function, we identified and extracted the URL pattern present in the note field.

CREATE OR REPLACE TEMP TABLE got_map_url AS
SELECT 
    REGEXP_EXTRACT(note, 'https://[^ ]+') AS map_url
FROM notes;

Once the table was created, we tested it to ensure it was functioning as expected.

SELECT * 
FROM got_map_url;
1 records
map_url
https://vignette.wikia.nocookie.net/iceandfire/images/3/37/Ice_and_Fire_World_Map.png/revision/latest?cb=20130127004523

Locations

Next, we created the got_locations table to store information about the map locations. We used LOWER(TRIM(name)) to create a normalized_name column, ensuring consistency by converting the location names to lowercase and removing any extra whitespace. Additionally, we cast the name as TEXT, and the coordinates (fx and fy) as FLOAT to represent loc_x and loc_y.

CREATE OR REPLACE TEMP TABLE got_locations AS
SELECT 
    LOWER(TRIM(name)) AS normalized_name,
    CAST(fx AS FLOAT) AS loc_x,
    CAST(fy AS FLOAT) AS loc_y
FROM locations;

Validation Queries

-- Check row count
SELECT COUNT(*) AS row_count
FROM got_locations;
1 records
row_count
22
-- Check data types
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'got_locations';
3 records
column_name data_type is_nullable
normalized_name VARCHAR YES
loc_x FLOAT YES
loc_y FLOAT YES
-- Check missing values
SELECT column_name, COUNT(*) - COUNT(column_name) AS missing_count
FROM information_schema.columns
WHERE table_name = 'got_locations'
GROUP BY column_name;
3 records
column_name missing_count
normalized_name 0
loc_y 0
loc_x 0
-- Check for duplicates
SELECT 
COUNT(*) - COUNT(DISTINCT (normalized_name, loc_x, loc_y)) AS duplicate_count
FROM got_locations;
1 records
duplicate_count
0

Once the table was created, we tested it to ensure it was functioning as expected.

SELECT * 
FROM got_locations;
Displaying records 1 - 10
normalized_name loc_x loc_y
astapor 19.196 17.342
braavos 10.342 9.363
dorne 8.416 16.529
dragonstone 8.311 11.967
dreadfort 7.280 5.999
eastwatch 7.395 3.822
harrenhal 6.218 11.519
king’s landing 6.697 12.759
last hearth 7.025 4.911
meereen 19.935 15.592

The map location data is imported from the ‘opening-locations.json’ file, which contains two key-value pairs: note and locations. We processed each individually. From the note key-value pair, we extracted the URL of the map image using regex. For the locations key-value pair, we extracted the data into a location data frame. Since the locations contain a dictionary with multiple fields for each location, we used apply to access each dictionary’s fields individually for each row.

raw_map = pd.read_json('data/opening-locations.json')

# Retrieve map url from note
got_map_url = re.findall( 'https://[^ ]+', raw_map['note'][0] )

got_map_url
['https://vignette.wikia.nocookie.net/iceandfire/images/3/37/Ice_and_Fire_World_Map.png/revision/latest?cb=20130127004523']
got_locations = (
    raw_map.assign(
        name = lambda x: x['locations'].apply(lambda d: d['name']),
        fx = lambda x: x['locations'].apply(lambda d: d['fx']),
        fy = lambda x: x['locations'].apply(lambda d: d['fy'])
    )
    .drop( columns = ['note', 'locations'] )
)

got_locations
              name      fx      fy
0          Astapor  19.196  17.342
1          Braavos  10.342   9.363
2            Dorne   8.416  16.529
3      Dragonstone   8.311  11.967
4        Dreadfort   7.280   5.999
5        Eastwatch   7.395   3.822
6        Harrenhal   6.218  11.519
7   King's Landing   6.697  12.759
8      Last Hearth   7.025   4.911
9          Meereen  19.935  15.592
10     Moat Cailin   5.947   8.145
11         Oldtown   3.114  15.633
12          Pentos  10.395  12.728
13            Pyke   3.229  10.572
14           Qarth  25.580  18.685
15        Riverrun   5.051  11.061
16       The Eyrie   7.030  10.426
17       The Twins   5.260   9.811
18        The Wall   6.770   3.822
19    Vaes Dothrak  23.726  11.686
20      Winterfell   5.666   6.218
21          Yunkai  19.550  16.102

Validation Queries

# Check data types for each column
got_locations.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    22 non-null     object 
 1   fx      22 non-null     float64
 2   fy      22 non-null     float64
dtypes: float64(2), object(1)
memory usage: 656.0+ bytes
# Check missing values
got_locations.isnull().sum()
name    0
fx      0
fy      0
dtype: int64
# Check for duplicates
got_locations.duplicated().sum()
np.int64(0)
# Check if the location dataframe is rectangular
# We verify whether the number of each column has the same number of rows
(got_locations.apply(len)
  .eq(len(got_locations))
  .all())
np.True_

Import Characters Data

The character data is imported from the ‘characters_v4.csv’ file. We followed the same workflow as below to explicitly set column types, preventing Tidyverse from guessing them. Additionally, we used the janitor::clean_names() function to clean and standardize the column names.

character_col_spec = cols(
  Character = col_character(),
  `Actor/ess` = col_character(),
  Episodes_appeared = col_integer(),
  First_appearance = col_integer(),
  Last_appearance = col_integer()
)

got_characters <- read_csv( "data/characters_v4.csv", col_type = character_col_spec ) |>
  janitor::clean_names()

got_characters

Validation Queries

#Check row count
got_characters |> 
  summarize(row_count = n())
#Check data types
got_characters |> 
  glimpse()
Rows: 832
Columns: 5
$ character         <chr> "Tyrion Lannister", "Cersei Lannister", "Daenerys Ta…
$ actor_ess         <chr> "Peter Dinklage", "Lena Headey", "Emilia Clarke", "K…
$ episodes_appeared <int> 67, 62, 62, 62, 59, 59, 55, 52, 48, 47, 46, 42, 42, …
$ first_appearance  <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
$ last_appearance   <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019…
#Check missing values
got_characters |> 
  summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))
#Check for duplicates
got_characters |> 
  summarize(duplicate_count = sum(duplicated(got_characters)))

The character data is imported from the ‘characters_v4.csv’ file. We followed the same workflow as below by directly reading the data file into our DuckDB database, creating the characters table.

CREATE OR REPLACE TEMP TABLE characters AS
SELECT * FROM 'data/characters_v4.csv';

The got_characters table was created to transform and standardize the characters data for easier analysis. We created a normalized_character column by converting the character names to lowercase and trimming any extra whitespace, ensuring consistency when joining with other datasets. The character and actor_ess fields were also converted to lowercase for uniformity. Additionally, we cast numerical fields, such as episodes_appeared, first_appearance, and last_appearance, as INTEGER to ensure the correct data type.

CREATE OR REPLACE TEMP TABLE got_characters AS
SELECT 
    LOWER(TRIM(character)) AS normalized_character,
    LOWER("actor/ess") AS actor_ess,
    CAST(episodes_appeared AS INTEGER) AS episodes_appeared,
    CAST(first_appearance AS INTEGER) AS first_appearance,
    CAST(last_appearance AS INTEGER) AS last_appearance
FROM characters;

Validation Queries

-- Check row count
SELECT COUNT(*) AS row_count
FROM got_characters;
1 records
row_count
832
-- Check data types
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'got_characters';
5 records
column_name data_type is_nullable
normalized_character VARCHAR YES
actor_ess VARCHAR YES
episodes_appeared INTEGER YES
first_appearance INTEGER YES
last_appearance INTEGER YES
-- Check missing values
SELECT column_name, COUNT(*) - COUNT(column_name) AS missing_count
FROM information_schema.columns
WHERE table_name = 'got_characters'
GROUP BY column_name;
5 records
column_name missing_count
episodes_appeared 0
last_appearance 0
normalized_character 0
actor_ess 0
first_appearance 0
-- Check for duplicates
SELECT 
COUNT(*) - COUNT(DISTINCT (normalized_character, actor_ess, episodes_appeared, first_appearance, last_appearance)) AS duplicate_count
FROM got_characters;
1 records
duplicate_count
0

Once the table was created, we tested it to ensure it was functioning as expected.

SELECT *
FROM got_characters;
Displaying records 1 - 10
normalized_character actor_ess episodes_appeared first_appearance last_appearance
tyrion lannister peter dinklage 67 2011 2019
cersei lannister lena headey 62 2011 2019
daenerys targaryen emilia clarke 62 2011 2019
jon snow kit harington 62 2011 2019
sansa stark sophie turner 59 2011 2019
arya stark maisie williams 59 2011 2019
jaime lannister nikolaj coster-waldau 55 2011 2019
jorah mormont iain glen 52 2011 2019
samwell tarly john bradley 48 2011 2019
theon greyjoy alfie allen 47 2011 2019

The character data is imported from the ‘characters_v4.csv’ file. We followed the same workflow as below to explicitly set column types, preventing pandas from guessing them. Additionally, we rename each column name to standardize them.

character_type_spec = {
    'Character': 'string',
    'Actor/ess': 'string',
    'Episodes_appeared': 'int32',
    'First_appearance': 'int32',
    'Last_appearance': 'int32',
}

# Read the CSV file
raw_characters = pd.read_csv(
    "data/characters_v4.csv",
    dtype=character_type_spec
)

got_characters = (
    raw_characters.rename(columns = str.lower)
              .rename(columns = lambda x: x.replace( ' ', '_' ))
)

Validation Queries

# Check data types
got_characters.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 832 entries, 0 to 831
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   character          819 non-null    string
 1   actor/ess          832 non-null    string
 2   episodes_appeared  832 non-null    int32 
 3   first_appearance   832 non-null    int32 
 4   last_appearance    832 non-null    int32 
dtypes: int32(3), string(2)
memory usage: 22.9 KB
# Check missing values
got_characters.isnull().sum()
character            13
actor/ess             0
episodes_appeared     0
first_appearance      0
last_appearance       0
dtype: int64
# Check for duplicates
(got_characters.groupby('character')
  .size()
  .reset_index(name='count')
  .query('count > 1'))
                character  count
24               Baby Sam      4
30      Baratheon Soldier      3
35   Bathhouse Prostitute      2
40       Beric Dondarrion      2
55         Bolton Soldier      3
..                    ...    ...
640                 Wight      6
644              Wildling      9
647       Wildling Archer      2
651                 Willa      2
657     Winter Town Woman      2

[66 rows x 2 columns]
# There are duplicates due a general character appearing multiple times in a single episode or a character was played by different actors in an episode.

Import a look-up table to map locations to their corresponding regions

We required a look-up table to get the relationship between death locations and map regions. Due to insufficient data, we used Perplexity to generate the table. Our process is described below:

First, we extracted the individual lists of death locations and map regions.

# Retrieve death location list
distinct( got_deaths, location )
# Retrieve map region list
got_locations |>
  select(name)
--list of death locations
SELECT DISTINCT location
FROM raw_deaths;
Displaying records 1 - 10
Location
Winterfell
Harrenhal
Qarth
Valyria
Highgarden
Beyond the Wall
Eastern Road
The Eyrie
Storm’s End
Dorne
--list of map regions
SELECT name 
FROM locations;
Displaying records 1 - 10
name
Astapor
Braavos
Dorne
Dragonstone
Dreadfort
Eastwatch
Harrenhal
King’s Landing
Last Hearth
Meereen
# Extract location names from the death data
location_list = (
    got_deaths['location']
        .drop_duplicates(keep = 'first')
        .reset_index(drop = True)
)

location_list
0       Beyond the Wall
1            Winterfell
2        King's Landing
3                Pentos
4             Kingsroad
5          Eastern Road
6             The Eyrie
7          Vaes Dothrak
8          Castle Black
9                Lhazar
10            Red Waste
11              Unknown
12          Dragonstone
13           Riverlands
14              Oxcross
15            Harrenhal
16          Storm's End
17                Qarth
18    Robb Stark's camp
19              Astapor
20             Riverrun
21            Dreadfort
22          Hollow Hill
23             The Wall
24               Yunkai
25             The Gift
26            The Twins
27              Meereen
28          Mole's Town
29          Moat Cailin
30             The Vale
31              Braavos
32                Dorne
33              Valyria
34             Hardhome
35         Iron Islands
36       The Narrow Sea
37        Casterly Rock
38           Highgarden
39             Roseroad
40          Last Hearth
Name: location, dtype: string
# Extract region names from the map location data
region_list = got_locations.filter(items = ['name'])

region_list
              name
0          Astapor
1          Braavos
2            Dorne
3      Dragonstone
4        Dreadfort
5        Eastwatch
6        Harrenhal
7   King's Landing
8      Last Hearth
9          Meereen
10     Moat Cailin
11         Oldtown
12          Pentos
13            Pyke
14           Qarth
15        Riverrun
16       The Eyrie
17       The Twins
18        The Wall
19    Vaes Dothrak
20      Winterfell
21          Yunkai

Secondly, We used Perplexity to generate the look-up table. In our prompt, we included the complete lists of locations and regions and specified the desired output format to ensure clarity and ease of use.

Categorize each of the following Game of Thrones locations into their respective regions and present the output in a table. If a location does not have a direct one-to-one mapping, use your best knowledge to infer its region.

#output format
| location | region |

#location
Beyond the Wall             
Winterfell              
King's Landing              
Pentos              
Kingsroad               
Eastern Road                
The Eyrie               
Vaes Dothrak                
Castle Black                
Lhazar              
Red Waste               
Unknown             
Dragonstone             
Riverlands              
Oxcross             
Harrenhal               
Storm's End             
Qarth               
Robb Stark's camp               
Astapor             
Riverrun                
Dreadfort               
Hollow Hill             
The Wall                
Yunkai              
The Gift                
The Twins               
Meereen             
Mole's Town             
Moat Cailin             
The Vale                
Braavos             
Dorne               
Valyria             
Hardhome                
Iron Islands                
The Narrow Sea              
Casterly Rock               
Highgarden              
Roseroad                
Last Hearth

#region
Astapor
Braavos
Dorne
Dragonstone
Dreadfort
Eastwatch
Harrenhal
King's Landing
Last Hearth
Meereen
Moat Cailin
Oldtown
Pentos
Pyke
Qarth
Riverrun
The Eyrie
The Twins
The Wall
Vaes Dothrak
Winterfell
Yunkai

Once the table was generated, we copied it into Excel and saved it as a CSV file, ‘location_region.csv.’ We then imported the CSV file into our project, following the same workflow as the other data sets.

region_location_col_spec = cols(
  Region = col_character(),
  Location = col_character()
)

got_region_location = read_csv( "data/location_region.csv", col_type = region_location_col_spec ) |>
  janitor::clean_names()

got_region_location

Validation Queries

#Check row count
got_region_location |> 
  summarize(row_count = n())
#Check data types
got_region_location |> 
  glimpse()
Rows: 40
Columns: 2
$ location <chr> "Beyond the Wall", "Winterfell", "King's Landing", "Pentos", …
$ region   <chr> "The Wall", "Winterfell", "King's Landing", "Pentos", "The No…
#Check missing values
got_region_location |> 
  summarize(across(everything(), ~ sum(is.na(.)), .names = "missing_{.col}"))
#Check for duplicates
got_region_location |> 
  summarize(duplicate_count = sum(duplicated(got_region_location)))
CREATE OR REPLACE TEMP TABLE raw_region_location AS
SELECT * FROM 'data/location_region.csv';

We then created the got_region_location table to standardize the mapping of regions and locations. In this step, we used LOWER(TRIM()) to create normalized_region and normalized_location columns, ensuring the region and location names were consistent by converting them to lowercase and removing any extra whitespace. Additionally, we created region and location columns by converting their original values to lowercase for consistency.

CREATE OR REPLACE TEMP TABLE got_region_location AS
SELECT 
    LOWER(TRIM("Region")) AS normalized_region,
    LOWER(TRIM("Location")) AS normalized_location,
FROM raw_region_location;

Validation Queries

-- Check row count
SELECT COUNT(*) AS row_count
FROM got_region_location;
1 records
row_count
40
-- Check data types
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = 'got_region_location';
2 records
column_name data_type is_nullable
normalized_region VARCHAR YES
normalized_location VARCHAR YES
-- Check missing values
SELECT column_name, COUNT(*) - COUNT(column_name) AS missing_count
FROM information_schema.columns
WHERE table_name = 'got_region_location'
GROUP BY column_name;
2 records
column_name missing_count
normalized_region 0
normalized_location 0
-- Check for duplicates
SELECT 
COUNT(*) - COUNT(DISTINCT (normalized_region, normalized_location)) AS duplicate_count
FROM got_region_location;
1 records
duplicate_count
0

Once the table was created, we tested it to ensure it was functioning as expected.

SELECT * 
FROM got_region_location;
Displaying records 1 - 10
normalized_region normalized_location
the wall beyond the wall
winterfell winterfell
king’s landing king’s landing
pentos pentos
the north kingsroad
the vale eastern road
the eyrie the eyrie
vaes dothrak vaes dothrak
the wall castle black
essos lhazar
location_region_type_spec = {
    'Location' : 'string',
    'Region' : 'string'
}

# Read the CSV file
raw_location_region = pd.read_csv(
    "data/location_region.csv",
    dtype=location_region_type_spec
)

got_location_region = (
    raw_location_region.rename( columns=str.lower )
)

got_location_region
             location           region
0     Beyond the Wall         The Wall
1          Winterfell       Winterfell
2      King's Landing   King's Landing
3              Pentos           Pentos
4           Kingsroad        The North
5        Eastern Road         The Vale
6           The Eyrie        The Eyrie
7        Vaes Dothrak     Vaes Dothrak
8        Castle Black         The Wall
9              Lhazar            Essos
10          Red Waste            Essos
11        Dragonstone      Dragonstone
12         Riverlands       Riverlands
13            Oxcross      Westerlands
14          Harrenhal        Harrenhal
15        Storm's End       Stormlands
16              Qarth            Qarth
17  Robb Stark's camp        The North
18            Astapor          Astapor
19           Riverrun         Riverrun
20          Dreadfort        Dreadfort
21        Hollow Hill       Riverlands
22           The Wall         The Wall
23             Yunkai           Yunkai
24           The Gift        The North
25          The Twins        The Twins
26            Meereen          Meereen
27        Mole's Town        The North
28        Moat Cailin      Moat Cailin
29           The Vale         The Vale
30            Braavos          Braavos
31              Dorne            Dorne
32            Valyria            Essos
33           Hardhome  Beyond the Wall
34       Iron Islands             Pyke
35     The Narrow Sea            Essos
36      Casterly Rock      Westerlands
37         Highgarden        The Reach
38           Roseroad        The Reach
39        Last Hearth      Last Hearth

Validation Queries

# Check data types
got_location_region.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   location  40 non-null     string
 1   region    40 non-null     string
dtypes: string(2)
memory usage: 768.0 bytes
# Check missing values
got_location_region.isnull().sum()
location    0
region      0
dtype: int64
# Check for duplicates
got_location_region.duplicated().sum()
np.int64(0)
# Check for duplicate locations
(got_location_region.groupby('location')
  .size()
  .reset_index(name='count')
  .query('count > 1'))
Empty DataFrame
Columns: [location, count]
Index: []

Combine Data into One Data Frame

We addressed the issue of the many-to-many relationship between the character and death data sets. Some characters were listed using general terms, such as “Wildling” or “Stark Soldier”, and we didn’t have enough information to match them accurately. To handle this, we decided to remove these general terms. Since they could appear multiple times in the data sets, we removed the duplicate entries from both the character and death data sets.

# Exclude the characters having repeated names
got_characters <- got_characters |>
  distinct(character, .keep_all = TRUE)

got_characters
# Exclude the deaths having repeated names
got_deaths <- got_deaths |>
  distinct(name, .keep_all = TRUE)

got_deaths

Remove duplicated character names

--Exclude the characters having repeated names
CREATE OR REPLACE TEMP TABLE got_characters AS
SELECT DISTINCT ON (normalized_character) *
FROM got_characters;
SELECT *
FROM got_characters;
Displaying records 1 - 10
normalized_character actor_ess episodes_appeared first_appearance last_appearance
catelyn stark michelle fairley 26 2011 2016
roose bolton michael mcelhatton 20 2012 2016
alliser thorne owen teale 19 2011 2016
gregor ‘the mountain’ clegane hafþór júlíus björnsson 17 2014 2019
olly brenock o’connor 17 2014 2016
wun wun ian whyte 16 2011 2019
yohn royce rupert vansittart 13 2014 2019
high sparrow jonathan pryce 12 2015 2016
nymeria sand jessica henwick 8 2015 2017
marei josephine gillan 8 2012 2019

Remove duplicated victim names

--Exclude the deaths having repeated names
CREATE OR REPLACE TEMP TABLE got_deaths AS
SELECT DISTINCT ON (normalized_name) *
FROM got_deaths;
SELECT *
FROM got_deaths;
Displaying records 1 - 10
normalized_name normalized_allegiance season episode normalized_location killer killers_house method death_no
robert baratheon house baratheon of king’s landing 1 7 king’s landing Boar None Tusk 34
qotho dothraki 1 9 red waste Jorah Mormont House Targaryen Sword 51
irri house targaryen 2 6 qarth Unknown Unknown Unknown 93
hoster tully house tully 3 2 riverrun None None Illness 190
rickard karstark house stark 3 5 riverrun Robb Stark House Stark Sword 209
horse breeder night’s watch 3 9 the gift Ygritte Free Folk Arrow 223
yunkai soldier wise masters 3 9 yunkai Jorah Mormont House Targaryen Sword 228
unborn stark child house stark 3 9 the twins Lothar Frey House Frey Knife 239
robb stark house stark 3 9 the twins Roose Bolton House Bolton Knife 265
joffrey baratheon house lannister 4 2 king’s landing Olenna Tyrell House Tyrell Poison 285

Validation Queries

We ran another duplicate check on the got_deaths table to ensure there are no more duplicate values.

-- Check for duplicates
SELECT 
COUNT(*) - COUNT(DISTINCT (normalized_name, normalized_allegiance, season, episode, normalized_location, killer, killers_house, method, death_no)) AS duplicate_count
FROM got_deaths;

--There are no more duplicates after general characters are removed from the dataset
1 records
duplicate_count
0
# Exclude rows with repeated character names
got_characters = got_characters.drop_duplicates(subset = 'character', keep = 'first')
# Exclude rows with repeated names
got_deaths = got_deaths.drop_duplicates(subset = 'name', keep = 'first')

We moved on to the main process of data combination.

First, we joined the death data with the episode data based on the season and episode. Next, we used the victim’s name to join the character data, which provided the number of episodes each victim appeared in. We then dropped the irrelevant columns to ensure the data was focused on our goals. During this stage, we discovered that some victims were missing episode appearance data. Assuming each character appeared at least once, we filled in the missing values with 1. Finally, we used the look-up table to join the death location with the corresponding map region. Through this process, we created a unified data frame for further analysis and visualization.

deaths_detail_info <- got_deaths |>
  left_join( got_episodes, by = c("season", "episode")  ) |>
  left_join( got_characters, by = join_by( name == character )  ) |>
  select( season, episode, title, rating, name, killer, location, episodes_appeared ) |>
  # Assume characters without data for episodes_appeared have appeared at least once in a scene
  mutate( episodes_appeared = replace_na( episodes_appeared, 1 )) |>
  left_join( got_region_location, by = "location" )|>
  left_join( got_locations, by = join_by( region == name )  )

deaths_detail_info

Instead of creating a new identifier, we joined the datasets based directly on the season and episode fields from the death and episode data. We then used these fields to join the episode data (got_episodes).

Next, we joined the character data (got_characters) using the normalized_name of each victim, which provided the number of episodes each victim appeared in (episodes_appeared). We then dropped the irrelevant columns to ensure the data was focused on our goals. During this stage, we discovered that some victims were missing episode appearance data. Assuming each character appeared at least once, we filled in the missing values with 1 using COALESCE.

Following this, we used the death normalized_location to join the got_region_location table, giving us the normalized_region for each death. Finally, we joined the map coordinates (got_locations) based on the region, which allowed us to add geographical information (loc_x and loc_y) for each death event. Through this process, we created a unified data frame for further analysis and visualization.

CREATE OR REPLACE TEMP TABLE deaths_detail_info AS
SELECT 
    d.season,
    d.episode,
    r.title,
    r.rating,
    d.normalized_name,
    d.killer,
    d.normalized_location,
    COALESCE(c.episodes_appeared, 1) AS episodes_appeared,
    reg.normalized_region,
    loc.loc_x,
    loc.loc_y
FROM got_deaths d
LEFT JOIN got_episodes r ON d.season = r.season AND d.episode = r.episode
LEFT JOIN got_characters c ON d.normalized_name = c.normalized_character
LEFT JOIN got_region_location reg ON d.normalized_location = reg.normalized_location
LEFT JOIN got_locations loc ON reg.normalized_region = loc.normalized_name
ORDER BY 
    d.season,
    d.episode,
    episodes_appeared DESC;

Once the table was created, we tested it to ensure it was functioning as expected.

SELECT *
FROM deaths_detail_info;
Displaying records 1 - 10
season episode title rating normalized_name killer normalized_location episodes_appeared normalized_region loc_x loc_y
1 1 winter is coming 9.1 gared White Walker beyond the wall 1 the wall 6.770 3.822
1 1 winter is coming 9.1 will Ned Stark winterfell 1 winterfell 5.666 6.218
1 1 winter is coming 9.1 waymar royce White Walker beyond the wall 1 the wall 6.770 3.822
1 1 winter is coming 9.1 jon arryn Lysa Arryn king’s landing 1 king’s landing 6.697 12.759
1 1 winter is coming 9.1 dothraki man Dothraki man pentos 1 pentos 10.395 12.728
1 1 winter is coming 9.1 stag Direwolf winterfell 1 winterfell 5.666 6.218
1 1 winter is coming 9.1 direwolf Stag winterfell 1 winterfell 5.666 6.218
1 2 the kingsroad 8.8 catspaw assassin Summer winterfell 1 winterfell 5.666 6.218
1 2 the kingsroad 8.8 mycah Sandor “the Hound” Clegane kingsroad 1 the north NA NA
1 2 the kingsroad 8.8 lady Ned Stark kingsroad 1 the north NA NA

First, we joined the death data with the episode data based on the season and episode. Next, we used the victim’s name to join the character data, which provided the number of episodes each victim appeared in. We then dropped the irrelevant columns to ensure the data was focused on our goals. During this stage, we discovered that some victims were missing episode appearance data. Assuming each character appeared at least once, we filled in the missing values with 1. Finally, we used the look-up table to join the death location with the corresponding map region. Through this process, we created a unified data frame for further analysis and visualization.

deaths_detail_info = (got_deaths  
     # Create an identifier `season_episode` for merging with the episode rating data
     .assign( identifier = lambda df:
                 df.apply(lambda row:
                         f"{row['season']}_{row['episode']}",
                         axis = 1)
            )
     # Merge episode rating data using the common identifier (season_episode)
     .merge(got_episodes,
            on = ['season', 'episode'],
            how='left')
                      
     # Merge character data based on the character name
     .merge(got_characters,
            left_on = 'name',
            right_on = 'character',
            how='left')
    
     # Remain only the relevant columns
     .filter(items = ['season', 'episode', 'title', 'rating', 'name', 'killer', 'location', 'episodes_appeared'])
                      
     # Fill missing episode appearance values with 1 (due to each character appeared at least once)
     .assign(episodes_appeared = lambda df: df['episodes_appeared'].fillna(1) )
                      
     # Merge location-region look-up table                 
     .merge(got_location_region,
            on = 'location',
            how='left')
                      
     # Merge map location data based on region                 
     .merge(got_locations,
            left_on = 'region',
            right_on = 'name',
            how='left')
     .rename(columns = {
         'name_x': 'name'
     })
    .drop(columns = ['name_y'])
)

deaths_detail_info
     season  episode             title  ...          region     fx      fy
0         1        1  Winter Is Coming  ...        The Wall  6.770   3.822
1         1        1  Winter Is Coming  ...        The Wall  6.770   3.822
2         1        1  Winter Is Coming  ...      Winterfell  5.666   6.218
3         1        1  Winter Is Coming  ...      Winterfell  5.666   6.218
4         1        1  Winter Is Coming  ...      Winterfell  5.666   6.218
..      ...      ...               ...  ...             ...    ...     ...
248       8        5         The Bells  ...  King's Landing  6.697  12.759
249       8        5         The Bells  ...  King's Landing  6.697  12.759
250       8        5         The Bells  ...  King's Landing  6.697  12.759
251       8        5         The Bells  ...  King's Landing  6.697  12.759
252       8        6   The Iron Throne  ...  King's Landing  6.697  12.759

[253 rows x 11 columns]

Data Analysis and Visualization

Task1: How do the death of prominent characters influence the ratings?

To analyze the data, we organized it by season, episode, and the number of episodes to identify the prominent characters within each episode. We then grouped the data by season and episode for aggregation. This allows us to calculate the death count for each episode while retaining the episode ratings for further analysis. Upon examining the aggregation results, we observed that not every episode has character deaths. To address this, we joined the aggregated data with the original episode data to include ratings for all episodes and filled the death count with zeros for episodes without any deaths. Finally, we created a scatter plot to visualize the relationship between episode ratings, death counts, and the deaths of prominent characters.

episode_summary <- deaths_detail_info |>
  group_by( season, episode ) |>
  arrange( desc(episodes_appeared), by_group = TRUE ) |>
  summarize(
      num_deaths = n(),  
      ep_rating = first(rating),
      prominent_death_episodes_appeared = first(episodes_appeared),
      .groups = "drop"
  ) |>
  right_join( got_episodes |>
    select( season, episode, rating ),
    by = c("season", "episode")
  ) |>
 mutate(
   num_deaths = replace_na(num_deaths, 0),
   ep_rating = rating,
   prominent_death_episodes_appeared = replace_na(prominent_death_episodes_appeared, 0)
  ) |>
  select(season, episode, num_deaths, ep_rating, prominent_death_episodes_appeared) |>
  arrange( season, episode )

episode_summary
# Adding 1 to ensure the logarithm is valid,
ggplot(episode_summary, aes(x = ep_rating, y = log(num_deaths + 1))) +
  geom_point(aes(size = prominent_death_episodes_appeared), alpha = 0.8, color = "blue") + 
  scale_size_continuous(range = c(1, 8), name = "Appeared Episodes of\n Prominent Characters") +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Relationship Between Episode Ratings, Death Count,\n and Prominent Character Deaths",
    x = "Episode Rating",
    y = "log(Death Count)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),
    axis.title = element_text(size = 12)
  )
`geom_smooth()` using formula = 'y ~ x'

Analysis: This scatterplot shows a positive correlation between episode ratings and the number of deaths in the show. Episodes with higher ratings tend to have more deaths, especially as ratings approach 10. The regression line has a slight upward slope, suggesting a mild positive correlation between the two variables. Although the trend is positive, the correlation appears to be weak, as the data points are widely scattered around the line. There are several episodes with both high and low death counts across varying ratings, indicating that while episodes with higher ratings may have slightly more deaths, it is not a strong or consistent pattern. Additionally, the deaths of prominent characters appear to be randomly distributed, making it difficult to identify any clear relationship between these deaths and episode ratings.

To analyze the data, we first organized it by season and episode. Next, we then grouped the data by season and episode to calculate the total number of deaths in each episode, while also identifying the most prominent characters based on their number of episodes appeared before their death. This allows us to calculate the death count for each episode and determine the prominence of deaths using the maximum episodes_appeared value. Upon examining the aggregation results, we observed that not every episode has character deaths. To address this, we joined the aggregated data with the original episode data to include ratings for all episodes and filled the death count with zeros for episodes without any deaths.

WITH
  death_count AS (
    SELECT
        season,
        episode,
        COUNT(*) AS num_deaths, 
        MAX(rating) AS ep_rating, 
        MAX(episodes_appeared) AS prominent_death_episodes_appeared 
    FROM deaths_detail_info
    GROUP BY season, episode
  )
SELECT
  got_episodes.season,
  got_episodes.episode,
  got_episodes.rating,
  COALESCE(death_count.num_deaths, 0) AS num_deaths,
  COALESCE(death_count.prominent_death_episodes_appeared, 0) AS prominent_death_episodes_appeared
FROM got_episodes
LEFT JOIN death_count 
    ON got_episodes.season = death_count.season 
    AND got_episodes.episode = death_count.episode
ORDER BY got_episodes.season, got_episodes.episode;
Displaying records 1 - 10
season episode rating num_deaths prominent_death_episodes_appeared
1 1 9.1 7 1
1 2 8.8 3 1
1 3 8.7 0 0
1 4 8.8 1 1
1 5 9.1 5 5
1 6 9.2 5 5
1 7 9.2 1 7
1 8 9.0 8 6
1 9 9.6 4 6
1 10 9.5 3 10

To analyze the data, we organized it by season, episode, and the number of episodes to identify the prominent characters within each episode. We then grouped the data by season and episode for aggregation. This allows us to calculate the death count for each episode while retaining the episode ratings for further analysis. Finally, we created a scatter plot to visualize the relationship between episode ratings, death counts, and the deaths of prominent characters.

episode_summary = ( deaths_detail_info
     .sort_values(by = ['season', 'episode', 'episodes_appeared'], 
                  ascending=[True, True, False])
     .groupby(['season', 'episode'])
     .agg(
        num_deaths = ('name', 'count'),
        rating = ('rating', 'first' ),
        episodes_appeared = ( 'episodes_appeared', 'first' )
      )
      .rename(columns = {'rating': 'ep_rating'})
      .merge(
        got_episodes[['season', 'episode', 'rating']],
        on=['season', 'episode'],
        how='right'
      )
      .fillna({
        'num_deaths': 0, 
        'episodes_appeared': 0
      })
      .filter(items = ['season', 'episode', 'num_deaths', 'episodes_appeared', 'rating'])
)

episode_summary
    season  episode  num_deaths  episodes_appeared  rating
0        1        1         7.0                1.0     9.1
1        1        2         3.0                1.0     8.8
2        1        3         0.0                0.0     8.7
3        1        4         1.0                1.0     8.8
4        1        5         5.0                5.0     9.1
..     ...      ...         ...                ...     ...
68       8        2         1.0                3.0     7.9
69       8        3         8.0               52.0     7.5
70       8        4         2.0               38.0     5.5
71       8        5        11.0               62.0     6.0
72       8        6         1.0               62.0     4.1

[73 rows x 5 columns]
# Create figure
plt.figure(figsize=(12, 8))

# Scaling factor for point size
SIZE_SCALE = 10

# Create scatterplot with varying sizes
scatter = plt.scatter( x = episode_summary['rating'], 
                       y = np.log(episode_summary['num_deaths']),
                       s = episode_summary['episodes_appeared'] * SIZE_SCALE,
                       alpha = 0.8,
                       color = 'blue')
/stor/home/lh36263/.virtualenvs/r-reticulate/lib/python3.9/site-packages/pandas/core/arraylike.py:399: RuntimeWarning: divide by zero encountered in log
  result = getattr(ufunc, method)(*inputs, **kwargs)
# Calculate the regression line
z = np.polyfit(episode_summary['rating'], np.log1p(episode_summary['num_deaths']), 1)
p = np.poly1d(z)


# Add regression line
plt.plot(episode_summary['rating'], 
         p(episode_summary['rating']), 
         color = 'blue', 
         linestyle = '-')


plt.title('Relationship Between Episode Ratings, Death Count, and Prominent Character Deaths')
plt.xlabel('Episode Rating', fontsize = 12)
plt.ylabel('log(Death Count)', fontsize = 12)

# Add legends
handles, labels = scatter.legend_elements(prop = "sizes", 
                                          alpha = 0.6,
                                          num = 5,
                                          func = lambda x: x/SIZE_SCALE)  # reverse the size scaling
plt.legend( handles, 
            labels,
            title = "Appeared Episodes of\nProminent Characters")

# Show the plot
plt.show()

Analysis: This scatterplot shows a positive correlation between episode ratings and the number of deaths in the show. Episodes with higher ratings tend to have more deaths, especially as ratings approach 10. The regression line has a slight upward slope, suggesting a mild positive correlation between the two variables. Although the trend is positive, the correlation appears to be weak, as the data points are widely scattered around the line. There are several episodes with both high and low death counts across varying ratings, indicating that while episodes with higher ratings may have slightly more deaths, it is not a strong or consistent pattern. Additionally, the deaths of prominent characters appear to be randomly distributed, making it difficult to identify any clear relationship between these deaths and episode ratings.

Task2: Which regions are targeted most often?

To identify the most frequently targeted regions, we first grouped the death data by region and then summarized it by counting the number of deaths in each region. We also kept the location coordinates for each region for further visualization. To ensure the data set was complete, we removed any rows where the coordinate value was missing. The missing data were likely due to the AI-generated look-up table, which may have produced regions outside the provided list, possibly offering a more accurate result based on its knowledge. We dropped these rows since they could not be visualized on the map.

death_location <- deaths_detail_info |>
  group_by( region ) |>
  summarize( death_count = n(),
             fx = first(fx),
             fy = first(fy)) |>
  drop_na(fx, fy)

death_location

To better understand the results, we visualized the data points on the map using a scatterplot. The map serves as the background, with each point representing a victim’s location. The size of the dots allows viewers to easily identify patterns in victim numbers across different regions.

# Read map image
map_raw_img <- image_read( got_map_url ) |>
  image_convert( format = "png" )

# Create a raster map with brightness adjustment
raster_map_img <- map_raw_img |>
  image_modulate( brightness = 120) |>
  as.raster()

# Calculate the image dimension
info <- image_info(map_raw_img)
IMAGE_DPI <- 96 # 96 dpi
width_in = info$width / IMAGE_DPI
height_in = info$height / IMAGE_DPI

# Plot the death count at their locations on the map
ggplot( death_location, aes( x = fx, y = height_in - fy ) ) +
  annotation_raster(raster_map_img, xmin = 0, xmax = width_in, ymin = 0, ymax = height_in) +
  geom_point( aes( size = death_count, color = death_count )) +
  scale_color_gradient(low = "#FFB7B2",
                       high = "#8A3A33") +
  scale_x_continuous(limits = c(0, width_in )) +
  scale_y_continuous(limits = c( 0, height_in)) +
  scale_size(range = c(2, 10)) +
  labs(title = "Targeted Region Across Ice and Fire World", x = "", y = "") +
  theme(legend.position = "none") +
  coord_fixed(ratio = 1)

Analysis: This map highlights regions with the highest death counts across the Game of Thrones world, with the largest concentration of deaths occurring in the central area of Westeros, specifically around King’s Landing. This distribution suggests that major conflicts and battles are concentrated in and around key political centers.

To identify the most frequently targeted regions, we first grouped the death data by region and then summarized it by counting the number of deaths in each region. We also kept the location coordinates for each region for further analysis. To ensure the data set was complete, we removed any rows where the coordinate value was missing. The missing data were likely due to the AI-generated look-up table, which may have produced regions outside the provided list, possibly offering a more accurate result based on its knowledge. We dropped these rows since they could not be visualized on the map.

SELECT
    normalized_region AS region,
    COUNT(*) AS death_count,       
    MAX(loc_x) AS fx,             
    MAX(loc_y) AS fy             
FROM deaths_detail_info
WHERE loc_x IS NOT NULL AND loc_y IS NOT NULL  
GROUP BY region;
Displaying records 1 - 10
region death_count fx fy
riverrun 4 5.051 11.061000
pyke 1 3.229 10.572000
last hearth 1 7.025 4.911000
vaes dothrak 3 23.726 11.686000
dragonstone 6 8.311 11.967000
astapor 4 19.196 17.341999
moat cailin 1 5.947 8.145001
braavos 2 10.342 9.363000
winterfell 31 5.666 6.218000
king’s landing 53 6.697 12.759000

To identify the most frequently targeted regions, we first grouped the death data by region and then aggregated it by counting the number of deaths in each region. We also kept the location coordinates for each region for further visualization. To ensure the data set was complete, we removed any rows where the coordinate value was missing. The missing data were likely due to the AI-generated look-up table, which may have produced regions outside the provided list, possibly offering a more accurate result based on its knowledge. We dropped these rows since they could not be visualized on the map.

death_location = ( deaths_detail_info
     .groupby('region')
     .agg(
        death_count=('region', 'count'),
        loc_x=('fx', 'first'),
        loc_y=('fy', 'first') )
     .dropna(subset=['loc_x', 'loc_y'])
)

death_location
                death_count   loc_x   loc_y
region                                     
Astapor                   4  19.196  17.342
Braavos                   2  10.342   9.363
Dorne                     9   8.416  16.529
Dragonstone               6   8.311  11.967
Dreadfort                 3   7.280   5.999
Harrenhal                 3   6.218  11.519
King's Landing           54   6.697  12.759
Last Hearth               1   7.025   4.911
Meereen                  18  19.935  15.592
Moat Cailin               1   5.947   8.145
Pentos                    1  10.395  12.728
Pyke                      1   3.229  10.572
Qarth                     8  25.580  18.685
Riverrun                  4   5.051  11.061
The Eyrie                 2   7.030  10.426
The Twins                10   5.260   9.811
The Wall                 40   6.770   3.822
Vaes Dothrak              3  23.726  11.686
Winterfell               32   5.666   6.218
Yunkai                    3  19.550  16.102

To better understand the results, we visualized the data points on the map using a scatterplot. The map serves as the background, with each point representing a victim’s location. The size of the dots allows viewers to easily identify patterns in victim numbers across different regions.

# Read map image
response = requests.get(got_map_url[0])
map_raw_img = Image.open(BytesIO(response.content))
map_raw_img
<PIL.PngImagePlugin.PngImageFile image mode=RGB size=3600x2400 at 0x7F1CDE197A90>

# Create a raster map with brightness adjustment
enhancer = ImageEnhance.Brightness(map_raw_img)
raster_map_img = enhancer.enhance(1.2)

# Calculate image dimensions
IMAGE_DPI = 96
width_in = map_raw_img.size[0] / IMAGE_DPI
height_in = map_raw_img.size[1] / IMAGE_DPI

# Create the plot
plt.figure(figsize=(width_in, height_in), dpi=IMAGE_DPI)

# Add the map as background
plt.imshow(raster_map_img, extent=[0, width_in, 0, height_in])

# Plot the points
scatter = plt.scatter(
    death_location['loc_x'], 
    height_in - death_location['loc_y'], # flip Y axis to align coordinate
    s = death_location['death_count'] * 100,  # adjust point size
    c = death_location['death_count'],
    alpha = 0.9
)

Analysis: This map highlights regions with the highest death counts across the Game of Thrones world, with the largest concentration of deaths occurring in the central area of Westeros, specifically around King’s Landing. This distribution suggests that major conflicts and battles are concentrated in and around key political centers.

Challenges

  • Handling Many-to-Many Mappings

    While processing the data, we encountered issues when trying to merge the death and character datasets, leading to duplicate names and causing a many-to-many mapping warning. This issue occurred because the original character names included repeated entries, specifically due to the presence of non-individual character names, such as Wildling, which represented a collective group rather than a specific character. After consulting with Dr. Howison, we decided to remove the duplicated names from the dataset, which resolved the many-to-many conflict and allowed for a cleaner join between the datasets.

  • Correcting Coordinate Alignment on the Map

    During the creation of a scatterplot showing death count points on the map, we noticed that some points were located in the sea rather than on land. Identifying the issue, we analyzed the reason by examining the map location data to understand how the coordinates were defined. We discovered that the coordinates followed a standard 2D screen coordinate system, in which the origin is at the top-left corner. Additionally, the map’s scale was specified in the notes, indicating that the DPI was 96. With this information, we transformed our coordinates to align with the data’s coordinate system. This adjustment allowed us to correctly plot the points on the map.

  • Bridging the Gap Between Data sets

    We encountered a challenge in joining the death data and map location data due to a lack of sufficient information. After searching online for a while, we realized that acquiring this data might not be possible. We then considered leveraging the power of AI to bridge the gap. While using AI introduced some uncertainty in the information, we manually examined the results. After several checks, we determined that the AI-generated data could serve as a useful source to help connect the data sets. Given that there were 41 rows involved, this approach reduced the effort needed to assign and search for data manually.

  • Loading JSON data into DuckDB

    Another challenge we faced was loading JSON data into DuckDB while working on the SQL part. Initially, we encountered errors when trying to use dbWriteTable() with data loaded using fromJSON() and we found out that this was due to the presence of nested structures. The fromJSON() function often returns a list with hierarchical data, while dbWriteTable() requires a flat data frame format, which DuckDB cannot handle directly if the input contains nested lists. We solved this issue by flattening the JSON data into a data frame format, ensuring all nested elements were converted into a tabular structure. Once flattened, the data was compatible with dbWriteTable(), allowing us to successfully write it to DuckDB.

    From Dr. Howison’s feedback on using native DuckDB reading methods, we then revised our approach. Instead of relying on external R functions like fromJSON() and dbWriteTable(), we utilized DuckDB’s native read_json_auto function to directly load the JSON file into a DuckDB table. This method handled the hierarchical JSON structures automatically, eliminating the need for manual flattening. Additionally, we used the UNNEST function to process the nested locations field, creating individual rows for each object in the array. By extracting specific fields with the ->> operator, we transformed the data into a usable table named locations.

  • Data Consistency and Matching During Joins

    While also working on the SQL part, the region column was empty after joining, indicating that the location values were not matching between tables. We realized that SQL joins are typically case-sensitive, causing issues when values do not match due to capitalization differences, and extra spaces in the location column between tables led to join failures. We solved this problem by applying transformations to standardize the values before performing the join. Specifically, we used functions like LOWER() to convert all text to lowercase and TRIM() to remove any extra spaces. This ensured that values from both tables matched regardless of case or formatting inconsistencies, allowing for a successful join and populating the region column correctly.

Takeaways

  • Comparison of different methods

    From learning about three different methods (R, SQL, Python) in class to process and visualize data, we found it helpful to use all three in this project to gain a better understanding of their strengths and capabilities. We found that R was particularly effective for data visualization and statistical analysis. The use of libraries such as dplyr for data manipulation and ggplot2 for creating detailed visualizations made R a useful tool for exploratory data analysis and deriving insights. We found that SQL was highly efficient for data processing, including data extraction, aggregation, and joining. Its ability to handle large datasets directly within the database environment made it important for the initial stages of data preparation and summary. Python struck a balance between SQL and R, offering both powerful data manipulation and the ability to easily integrate with other Python libraries, such as pandas for data wrangling and matplotlib for visualizations. We find this flexibility made Python particularly effective for combining data transformation tasks with in-depth analysis and visualization. By utilizing SQL for data preparation, R for visualization, and Python for flexible data processing, we were able to leverage each tool’s strengths to create a more efficient and comprehensive data analysis workflow.

  • Planning Before Starting the Project

    Another takeaway was the importance of planning before starting the project. Initially, we underestimated the time and effort needed to clean and transform the data. We faced issues due to insufficient planning for data preparation steps, specifically not combining the datasets into one cohesive dataset when working on the R portion. Instead, we pulled data from individual tables for our analysis, which led to inefficiencies in our workflow. This approach eventually required us to revisit and redo parts of the R analysis, combining all datasets into one while simultaneously working on the SQL and Python portions of the project. This experience taught us the value of thorough planning, especially for data integration, to ensure all relevant datasets are properly prepared from the beginning. Moving forward, we learned that spending more time on the initial planning and data preparation phase will allow for smoother transitions between different analysis tools and more effective in-depth analysis.

  • Importance of Carrying Out Validation Queries

    After our meeting with Dr. Howison, we realized the importance of carrying out validation queries in the project, especially when using multiple methods (R, SQL, and Python) to process and analyze data, as well as when combining datasets to ensure data integrity. While merging tables and cleaning data, running validation checks helped catch inconsistencies early. For example, it was important to check row counts to ensure no unintended data loss occurred, verify data types to maintain consistency across fields, and handle missing values appropriately to prevent gaps in our analysis. Additionally, checking for duplicates ensured that there were no unintended duplicate records, which could impact the accuracy of our results. These validation steps ensured that key metrics, such as death counts or character details, were accurately maintained across datasets. We learned that performing these checks was crucial in ensuring that our subsequent analyses were built on a solid, reliable foundation.